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Abstract 

Much of the current research concerning reliability is emphatically suggesting that 
researchers gather their own reliability estimates when administering an instrument. It has also 
been recommended that data with low reliability then be discarded. While some data obtained 
from instruments that originally yielded reliable results may be unreliable, it does not necessarily 
follow that the data are unuseful to researchers. This paper will contend that although data that 
are homogeneous might yield less reliable results than an inducted sample, these data should not 
be discarded until further examination of the data is conducted. Methods for determining data 
homogeneity will be discussed in detail. 
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Alternative Approaches for Interpreting Alpha with Homogeneous Subsamples 

Much of the current literature concerning reliability is emphatically suggesting that 
researchers obtain their own reliability estimates when gathering new data from a previously- 
developed instrument (Vacha-Haase, Kogan, & Thompson, in press). Moreover, Thompson and 
Vacha-HaaSe (2000), encouraged researchers not to “induct” reliability estimates for their own 
dataset from previous studies, but to obtain reliability estimates for their own dataset because 
reliability estimates are affected by individual sample characteristics. Concerning this practice 
of “inducting” reliability estimates, Pedhazur and Schmelkin (1991) stated: 

[Reliability estimates printed in -test manuals] may be useful for comparative 
purposes, but it is imperative to recognize that the relevant reliability estimate is 
the one obtained for the sample used in the [current] study under consideration. 

(p. 86) 

Similarly, Dawis (1987) contended that although the type of instrument utilized may influence 
reliability, reliability also might be influenced by sample composition and sample variability. 
Therefore, it is imperative that researchers compute and interpret reliability coefficients for their 
underlying sample. 

Noting the problems with interpreting alpha as a consistent measure of reliability, 
Pedhazur and Schmelkin (1991) stated, “coefficient alpha will underestimate the reliability of a 
measure when its items are not at least essentially tau-equivalent” (p. 100). Pedhazur and 
Schmelkin (1991) then give a brief discussion of the problems associated with interpreting alpha 
with differing test lengths (number of items), with restricting time allotted to administer the test, 
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and when instrument homogeneity (unidimensionality) is high. The last of these problems, 
homogeneity, is the focus of the present essay. 

Coefficient Alpha 

For the purposes of this paper, we will use the formula for coefficient alpha that was 
originally developed by Cronbach (1951). For reference purposes, the formula for computing 
alpha is 



a = 



k - 1 



1 - 



Ecr; 



2 A 



( 1 ) 



'* / 



where k is the number of items, Ecr 2 is the sum of the individual test item variances, and <r 2 is 
the total test variance. Although typically KR-20 is used with dichotomously-scored items, it 
should be noted that alpha equals KR-20 because each kth pi<qk equals each ith of , and across all 

k items Zp k q k - Eof (Thompson, 1999). Consulting Equation 1, a large alpha would be yielded 

from scores that necessarily had a small sum of individual item variances and a large total test 
variance. Likewise, a dataset yielding a small alpha coefficient would be produced by scores 
that had high individual item variances and a small total test variance. It should also be noted 
that although conceptually alpha represents a squared metric, it is mathematically possible to 
obtain an alpha value less than zero. This occurs when the sum of the individual item variances 
exceeds the total test variance. 

Consider the following heuristic example of datasets that would yield a low reliability 
estimate. The first example is a measure given in which all of the students simply guessed at 
responses to items and the variance of the individual items was as large or even larger than the 
total test variance. This might be the case if an instrument like the Scholastic Assessment Test 
was given to second graders. The second example is when there is a lack of variability among 
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examinees. For example, if a second-grade spelling test was administered to college English 
majors, all of the examinees would probably score close to the same value, thus yielding a small 
alpha. 

It has been recommended consistently in the literature that scores yielding low reliability 
estimates either be considered extremely suspect or discarded (Abelson, 1997). While in many 
instances, scores that yield low reliability coefficients indicate poor psychometric properties, it 
does not necessarily follow that the underlying data are always “unuseful” to researchers. 
Consider an example when a depression measure is administered to students who are all 
relatively not depressed, that is, who represent a homogeneous sample with respect to depression 
scores. Such scores likely would yield a low reliability estimate. However, interpreting this 
reliability coefficient without taking into consideration the homogeneous nature of the sample 
would be misleading. Rather, what is needed in such instances is not to discard the data, but to 
examine the scores further to determine why low reliability estimates were obtained from an 
instrument that produced high reliability estimates in the original test sample. In an attempt to 
overcome problems associated with low reliability estimates in homogeneous samples, 
Magnusson (1967) offered a formula based on the sample variance from the original normative 
group (i.e., the inducted sample) and the underlying sample group. Magnusson’s (1967) formula 
can be used to predict the reliability of scores from a present sample, based on the reliability of 
scores from the inducted sample and the standard deviations of both samples. 

Heuristic Example 

For the purposes of the present article, two heuristic datasets were utilized that were both 
(hypothetically) derived from the same eight-item instrument. The first dataset was designed 
such that the scores reflected what a homogeneous sample might look like if drawn from a 
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population of scores. It should be noted that there are slight variations within each item because 
including data in which all of the examinees had the same score contributes nothing to the 
variability of the data. The second dataset was generated to represent a completely random set of 
scores. The data used are illustrated in Tables 1 and 2. (We would like to bring to the reader’s 
attention the fact that we are not advocating that such small samples be utilized, due to their, 
ensuing low statistical power for detecting relationships. These small datasets are used for 
illustrative purposes only.) 



Insert Tables 1 and 2 about here 



Once data were generated, reliability analyses were conducted using the Statistical 
Package for the Social Sciences (SPSS; SPSS Inc., 2000). Results from the reliability analyses 
are provided in Table 3. The second column in Table 3 illustrates the variance and reliability 
estimates for the random data in Table 2. The third column represents the variance and 
reliability estimates for the homogeneous data in Table 1. In the fourth column, a hypothetical 
dataset also has been included to simulate plausible test manual data. We will use this dataset for 
comparative purposes with our other two random and homogeneous datasets. 



Insert Table 3 about here 



From Table 3, we see that the alpha coefficients from the random, homogeneous, and 
inducted dataset are .097, .071, and .883, respectively. In this example, although the inducted 
scores yielded a high reliability coefficient, scores from the two heuristic datasets yielded low 
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reliability estimates. The reason that alpha on these two datasets is low is because the sum of the 
item variances is almost as great as the total test variance. In the inducted dataset, the sum of the 
item variances is small, whereas the total test variance is large, thereby yielding a large alpha 
coefficient. 

The mean item variances in Table 3 are the overall mean of the eight item variances for 
each dataset. These values are .2573, .0625, and .1818 for the random, homogeneous, and 
inducted datasets, respectively. We see that the mean item variances effectively discriminates 
the random dataset from the homogeneous dataset. Specifically, whereas the mean item variance 
for the random dataset is larger than that for the inducted sample, the mean item variance for the 
homogeneous dataset is smaller than that for the inducted sample. 

The variance of the mean item variances (Table 3) is simply the squared standard 
deviation (a 2 ) of the mean item variances. These values are .0002, .0000, and .0056 for the 
random, homogeneous, and inducted datasets, respectively. It should be noted that a small 
variance of the mean item variances indicates that the mean item variances are relatively the 
same (e.g., either all small or all large). In each of the three datasets presented here, the variance 
of the mean item variances is small, thus indicating that nearly all of the item variances, within 
each dataset, are clustered relatively close together. Therefore, the variance of the mean item 
variances does not adequately discriminate the random and homogeneous samples. 

Also presented in Table 3 is the ratio of the sum of the individual test item variances to 
the total test variance for the random, homogeneous, and inducted datasets. These values are 
.9132, .9376, and .2274, respectively. This ratio, which represents the last entry of the Equation 1 
above, is the most important component of an alpha coefficient. Interestingly, these ratios do not 
adequately discriminate the random dataset from the homogeneous dataset. 
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The final statistic presented in Table 3 is the square of the standard error of measurement. 
The standard error of measurement, or the standard deviation of errors of measurement, provides 
an absolute rather than a relative measure of the extent to which raw and true scores are 
equivalent (Crocker & Algina, 1986). Although the true score can never be known, the standard 
error of measurement can be applied to an individual’s observed score to set “plausible limits” 
for locating the true score. These “plausible limits” provide confidence bands for interpreting an 
obtained score. The smaller the standard error of measurement, the smaller the confidence band, 
and the greater the confidence that the observed score is near the true score. From Table 3, we 
see that the squared standard error of measurement effectively discriminates the random dataset 
from the homogeneous dataset. Specifically, whereas the squared standard error of measurement 
for the random dataset is larger than that for the inducted sample, the squared standard error of 
measurement for the homogeneous dataset is smaller than that for the inducted sample. 

Detecting Homogeneous Samples: An Heuristic Example 

The purpose of this heuristic example is to illustrate how researchers might identify 
homogeneous datasets when examining scores with low reliability estimates. A useful identifier 
of data homogeneity from Table 3 is the mean item variances. By comparing the mean item 
variance of a given sample to the mean item variance of the inducted sample, researchers can 
determine whether or not a given sample is more or less homogeneous than is a previous sample, 
based on the magnitude of difference between the two mean item variances. 

Although specific rules for interpretation have not yet been developed, it would stand to 
reason that if the mean item variances for an observed sample was considerably smaller than the 
mean item variances reported in the test manual, then a researcher should examine the 
underlying data more closely in an effort to understand the differences in the alpha coefficients. 
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In particular, the difference in mean item variances can be expressed as a percentage to yield 
what we term a “relative mean item variance index.” The relative mean item variance index is 
computed as 

^ (2) 

Mg u 

where Mg * is the mean item variance for the inducted sample and Mg*, is the mean item 

variance for the underlying sample. This index ranges from -qo to 1 , with positive values 
indicating that the study sample is more homogeneous than the inducted sample, and negative 
values indicating that the study sample is less homogeneous than is the original sample. This 
index can be expressed as a percentage by multiplying the index by 100. For the present sample, 
the random group has a relative mean item variance index of -.4524, whereas the index 
pertaining to the homogeneous dataset is .6454. The negative relative index associated with the 
random dataset suggests that the low reliability index is not the result of homogeneity. 
Conversely, the relative index value pertaining to the homogeneous dataset indicates that the low 
reliability coefficient obtained for the homogeneous dataset is explained to some extent by the 
relative homogeneous nature of the dataset. In this latter case, it could be argued that the low 
reliability coefficient represents a statistical artifact. As such, whereas a researcher would refrain 
from conducting subsequent analyses using the random dataset, use of the homogeneous dataset 
may be both justified and meaningful. However, it should be noted that it does not necessarily 
follow that if scores are identified as homogeneous then the data always should be used in an 
analysis. In any case, the advantage of the relative mean item variance index is that it can be 
compared across studies. Indeed, this information can be utilized in generalizability studies. For 




10 



Alternative Approaches for Interpreting Alpha 10 



example, researchers in generalizability investigations can determine how much of the variance 
in reliability estimates is explained by this index. 

Although the variance of the mean item variances for the current example cannot be used 
to discriminate the random and homogeneous datasets because of their closeness in value, this 
statistic, alongside the mean item variances, can be utilized to eliminate a dataset from being 
deemed as representing a homogeneous sample. Specifically, whereas, as noted above, the mean 
item variances for the random dataset is relatively large, the variance of the mean item variances 
for the random data is very small (.0002), suggesting that each of the random data items has high 
variance and that the large mean item variances is not due to a specific outlier item. In a case 
like this, researchers would be encouraged probably to discard their data not only because the 
coefficient alpha is small, but also because the mean item variances suggests that the small 
reliability coefficient stem from a complete randomness of responses to the items. 

As noted above, the squared standard error of estimate appears to be particularly useful in 
assessing the level of homogeneity of an underlying sample. From this estimate, we have derived 
a simple index, called the “relative squared standard error of estimate index,” as 



where s 2 is the squared standard error of estimate from inducted sample and s 2 . is the squared 

standard error of estimate from underlying sample. This index also ranges from -oo to 1, with 
positive values indicating that the study sample is more homogeneous than is the inducted 
sample, and negative values indicating that the study sample is less homogeneous than is the 
inducted sample. For the present heuristic example, the random group has a relative squared 
standard error of estimate index of -1.7140, whereas the index pertaining to the homogeneous 
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dataset is .3382. As for the relative mean item variance index, the negative relative squared 
standard error of estimate index associated with the random dataset suggests that the low 
reliability index is not the result of homogeneity. Conversely, the relative squared standard error 
of estimate index value pertaining to the homogeneous dataset indicates that the low reliability 
coefficient obtained for the homogeneous dataset is explained to some extent by the relative 
homogeneous nature of the dataset. Moreover, the positive relative squared standard error of 
estimate index suggests that the obtained scores are closer to the true scores than is the case for 
the inducted sample, providing further justification for not discarding the homogeneous dataset. 
Real-Life Examples Illustrating When Homogeneity Yields Meaningful and Worthless Low 
Reliability Coefficients 

One of the main concerns posited in this paper is that researchers, when confronted with 
data that yield a low alpha coefficient, should seek to investigate the reason(s) for this low 
reliability estimate. We believe that in the case where the data are homogeneous, it does not 
follow that simply discarding the data is merited. Consider the two following examples of data 
that would yield low reliability estimates. The first example we will refer to as an illustration of 
“bad” homogeneity, and the second example we will refer to as an illustration of “good” 
homogeneity. 

After administering an examination, a researcher discovers that the dataset she is 
investigating had low reliability because an achievement test had been administered to a group of 
sixth graders that was originally intended for second graders. As a result, the entire cohort of 
students achieved high scores on the examination. In this instance, it could be justifiably 
contended that this instrument yields unreliable data for this specific sample. As has been 
mentioned previously, this does not mean that the instrument is unreliable (instruments are 
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neither reliable nor unreliable), but simply that the instrument is inappropriate for use with this 
cohort. This is an example of what is meant by “bad” homogeneity, or homogeneity that should 
consequentially lead to a discarding of the data. 

The second example represents a different but equally realistic scenario. Suppose that a 
depression scale has been administered to 45 patients of an outpatient clinic for depression. 
Researchers have administered this instrument because they are concerned with the effectiveness 
of a certain intervention and are measuring the gain scores on the depression scale. In this case, 
it would be expected that the reliability estimates would be very small for the 45 patients who are 
relatively homogeneous because they are in a depression clinic being treated for depression. 
However, simply disposing with the data in this instance seems unwarranted if indeed low 
reliability is due to homogeneity and not to randomness. The goal of administering the 
instrument to these patients is to monitor the effectiveness of the intervention. In research 
studies like this, the effect size will be maximized if and only if participants from one extreme 
of the scale (e.g., clinically-depressed participants) are the focus of the study. Otherwise, ceiling 
effects could confound results, thereby providing rival explanations (i.e., low internal validity). 
Additionally, findings from such an investigation would only be generalizable (i.e., have 
maximal external validity) if a clinically-depressed sample is utilized. This is an example of 
“good” homogeneity, or homogeneity that attenuates the reliability estimate in a manner that is 
directly a function of the level of homogeneity of the sample. 

In the second example, it is extremely likely that the mean item variances was smaller 
than the mean item variances reported in the test manual. If this were the case, the data should 
still be used in the analysis, particularly if the sample size was large, because the low reliability 
estimate is due to individual homogeneity and thus appears acceptable considering the context of 
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the study. Moreover, with respect to the homogeneous dataset presented in Table 1, it should be 
noted that if only the reliability coefficient in Table 3 had been reported, and further examination 
of the reliability properties had not been performed, readers of the subsequent final report might 
not have looked favorably on scores that yielded a reliability coefficient of . 10. 

Conclusion 

Although the major focus of this paper has been to advocate that researchers spend time 
examining the reasons behind data yielding low reliability estimates, it should be noted that, to 
date, no correction statistics exist for alpha coefficients with homogeneous samples. This can 
and should be the focus of future research in this area. While we encourage the examination of 
the mean item variances, the variance of the mean item variances, and the squared standard error 
of measurement, no specific guidelines have been developed for determining what is and what is 
not an acceptable threshold for data homogeneity. 

However, as illustrated above, the (squared) standard error of measurement provides 
extremely useful information about the degree of homogeneity of an underlying sample. This 
statistic has particular appeal because it is a function of both the standard deviation of scores 
generated from the complete test and the reliability estimate. Thus, the authors currently are 
attempting to develop an index that utilizes the ratio of the squared standard error of 
measurement for the underlying and the inducted samples. 

Because low reliability estimates reduce the statistical power associated with hypothesis 
tests (Onwuegbuzie & Daniel, 2000), researchers should utilize larger samples, when they expect 
that their sample is homogeneous, in order to compensate for the corresponding reliability-based 
loss in statistical power. In fact, increasing the sample size (i.e., a research-based consideration) 
would probably provide a better correction for attenuated relationships than any statistical 
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correction of the reliability coefficient itself (i.e., a statistical consideration), just as randomizing 
participants to treatment conditions in an experimental design is superior to using analysis of 
covariance techniques or other statistical adjustments to analyze data stemming from non- 
experimental research designs. 

Nevertheless, encouraging the investigation of coefficients of reliability can only help the 
growth of accountability in test usage. It is our contention that researchers should be encouraged 
not only to compute reliability estimates for their own data, but also to investigate and to explain 
why the reliability estimates differ from the original test manual norming (i.e., inducted) sample. 
Such information would allow readers to put subsequent findings in a more appropriate context. 
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Table 1 



Homogeneous dataset 
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Table 2 

Random dataset 



Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 
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Table 3 

Item and Test Characteristics for the Two Heuristic Datasets and the “Inducted” Dataset 



Variables 


Random 


Homogeneous 


“Inducted” 




Dataset 


Dataset 


Dataset 


Item 1 variance 


0.2461 


0.0588 


0.1094 


Item 2 variance 


0.2461 


0.0588 


0.1875 


Item 3 variance 


0.2500 


0.0588 


0.2125 


Item 4 variance 


0.2113 


0.0588 


0.2148 


Item 5 variance 


0.2461 


0.0588 


0.2461 


Item 6 variance 


0.2461 


0.0588 


0.1875 


Item 7 variance 


0.2461 


0.0588 


0.0588 


Item 8 variance 


0.2644 


0.0588 


0.1094 


X item variances 


1.9262 


0.4704 


1.3260 


Mean item variance 


0.2408 


0.0588 


.1658 


Variance of the mean 
item variances 


0.0001 


0.0000 


0.0037 


Total test variance 


2.1094 


0.5000 


5.9963 


X item variances/total 
test variance 


0.9132 


0.9408 


0.2211 


Alpha coefficient 


0.0974 


0.0714 


0.8830 


(Standard error of 
measurement) 2 


2.0309 


0.4952 


0.7483 


Relative mean item 
variance index 


-0.4524 


0.6454 




Relative squared 
standard error of 
estimate index 


-1.71410 


0.3382 






Note . All variance components were computed with the population size (N) and not the 
corrected sample size (N-l). 
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